Data Pipeline & Persistence Design
Overview
This document captures the design decisions made for the data ingestion pipeline, persistence strategy, and the boundary between the synchorinisation and analysis subsystems.
Notes: The analysis subsystem and ingestion layer will be documented in detail in subsequent documents.
System Decomposition
The system could be divided into two separated subsystems with distinct responsibilities:
| Subsystem | Responsibility |
|---|---|
| Synchronisation | Fetch raw data, parse it, and persist into DB |
| Analysis | Execute queries, render charts, compose dashboards and reports |
The synchronisation subsystem produces tables in DB, while the analysis subsystem consumes them.
Synchronisation Subsytem
Design Principle
New data types, e.g. finance, health, etc., require a new adapter. The role of this adapter is to persist raw input from a specified datasource to a specific table in DB. Keep in mind, the adapter does not handle how to fetch data from datasouce, since this is the responsibility of a separated layer.
While the alternative, which is allowing runtime-defined transformation logic such as user-supplied SQL expressions evaluated during ingestion, offers better flexibility when integrating with new data types, but introduces an unacceptable security surface.
Adapter Contract
Each adapter takes a raw input, transform this data, and persist to DB. That means, an adapter is composed of two parts:
- Mapper: maps the raw input into a specified domain record
- Repository: handles the interaction between the application and the DB
Analysis Subsystem
Design Principles
The analysis subsystem is modelled on three principles: SQL-first, query-centric, and decoupled from ingestion concerns. User interact with DB tables directly through the UI by writing queries, defining transformation views, configuring charts, and composing dashboards.